1 00:00:09,679 --> 00:00:13,891 - Hello? Okay, it's after 12, so I want to get started. 2 00:00:13,891 --> 00:00:17,822 So today, lecture eight, we're going to talk about deep learning software. 3 00:00:17,822 --> 00:00:21,283 This is a super exciting topic because it changes a lot every year. 4 00:00:21,283 --> 00:00:25,621 But also means it's a lot of work to give this lecture 'cause it changes a lot every year. 5 00:00:25,621 --> 00:00:30,024 But as usual, a couple administrative notes before we dive into the material. 6 00:00:30,024 --> 00:00:34,563 So as a reminder the project proposals for your course projects were due on Tuesday. 7 00:00:34,563 --> 00:00:42,766 So hopefully you all turned that in, and hopefully you all have a somewhat good idea of what kind of projects you want to work on for the class. 8 00:00:42,766 --> 00:00:50,217 So we're in the process of assigning TA's to projects based on what the project area is and the expertise of the TA's. 9 00:00:50,217 --> 00:00:54,264 So we'll have some more information about that in the next couple days I think. 10 00:00:54,264 --> 00:00:56,563 We're also in the process of grading assignment one, 11 00:00:56,563 --> 00:01:00,942 so stay tuned and we'll get those grades back to you as soon as we can. 12 00:01:00,942 --> 00:01:08,680 Another reminder is that assignment two has been out for a while. That's going to be due next week, a week from today, Thursday. 13 00:01:08,680 --> 00:01:16,231 And again, when working on assignment two, remember to stop your Google Cloud instances when you're not working to try to preserve your credits. 14 00:01:16,231 --> 00:01:24,812 And another bit of confusion, I just wanted to re-emphasize is that for assignment two you really only need to use GPU instances for the last notebook. 15 00:01:24,812 --> 00:01:32,250 For all of the several notebooks it's just in Python and Numpy so you don't need any GPUs for those questions. 16 00:01:32,250 --> 00:01:36,701 So again, conserve your credits, only use GPUs when you need them. 17 00:01:36,701 --> 00:01:39,973 And the final reminder is that the midterm is coming up. 18 00:01:39,973 --> 00:01:45,683 It's kind of hard to believe we're there already, but the midterm will be in class on Tuesday, five nine. 19 00:01:45,683 --> 00:01:47,901 So the midterm will be more theoretical. 20 00:01:47,901 --> 00:01:57,071 It'll be sort of pen and paper working through different kinds of, slightly more theoretical questions to check your understanding of the material that we've covered so far. 21 00:01:57,071 --> 00:02:02,506 And I think we'll probably post at least a short sort of sample of the types of questions to expect. 22 00:02:02,506 --> 00:02:03,695 Question? 23 00:02:03,695 --> 00:02:05,310 [student's words obscured due to lack of microphone] 24 00:02:05,310 --> 00:02:10,675 Oh yeah, question is whether it's open-book, so we're going to say closed note, closed book. 25 00:02:10,675 --> 00:02:15,671 So just, Yeah, yeah, so that's what we've done in the past is just closed note, closed book, relatively 26 00:02:15,671 --> 00:02:21,735 just like want to check that you understand the intuition behind most of the stuff we've presented. 27 00:02:23,618 --> 00:02:27,577 So, a quick recap as a reminder of what we were talking about last time. 28 00:02:27,577 --> 00:02:29,737 Last time we talked about fancier optimization algorithms 29 00:02:29,737 --> 00:02:34,975 for deep learning models including SGD Momentum, Nesterov, RMSProp and Adam. 30 00:02:34,975 --> 00:02:45,492 And we saw that these relatively small tweaks on top of vanilla SGD, are relatively easy to implement but can make your networks converge a bit faster. 31 00:02:45,492 --> 00:02:48,529 We also talked about regularization, especially dropout. 32 00:02:48,529 --> 00:02:56,975 So remember dropout, you're kind of randomly setting parts of the network to zero during the forward pass, and then you kind of marginalize out over that noise in the back at test time. 33 00:02:56,975 --> 00:03:02,805 And we saw that this was kind of a general pattern across many different types of regularization in deep learning, where you might add some kind 34 00:03:02,805 --> 00:03:08,415 of noise during training, but then marginalize out that noise at test time so it's not stochastic at test time. 35 00:03:08,415 --> 00:03:15,376 We also talked about transfer learning where you can maybe download big networks that were pre-trained on some dataset and then fine tune them for your own problem. 36 00:03:15,376 --> 00:03:21,314 And this is one way that you can attack a lot of problems in deep learning, even if you don't have a huge dataset of your own. 37 00:03:22,781 --> 00:03:29,615 So today we're going to shift gears a little bit and talk about some of the nuts and bolts about writing software and how the hardware works. 38 00:03:29,615 --> 00:03:36,276 And a little bit, diving into a lot of details about what the software looks like that you actually use to train these things in practice. 39 00:03:36,276 --> 00:03:43,967 So we'll talk a little bit about CPUs and GPUs and then we'll talk about several of the major deep learning frameworks that are out there in use these days. 40 00:03:45,471 --> 00:03:52,961 So first, we've sort of mentioned this off hand a bunch of different times, that computers have CPUs, computers have GPUs. 41 00:03:52,961 --> 00:04:02,655 Deep learning uses GPUs, but we weren't really too explicit up to this point about what exactly these things are and why one might be better than another for different tasks. 42 00:04:02,655 --> 00:04:06,472 So, who's built a computer before? Just kind of show of hands. 43 00:04:06,472 --> 00:04:10,965 So, maybe about a third of you, half of you, somewhere around that ballpark. 44 00:04:10,965 --> 00:04:15,174 So this is a shot of my computer at home that I built. 45 00:04:15,174 --> 00:04:22,261 And you can see that there's a lot of stuff going on inside the computer, maybe, hopefully you know what most of these parts are. 46 00:04:22,261 --> 00:04:25,594 And the CPU is the Central Processing Unit. 47 00:04:25,594 --> 00:04:31,391 That's this little chip hidden under this cooling fan right here near the top of the case. 48 00:04:31,391 --> 00:04:39,555 And the CPU is actually relatively small piece. It's a relatively small thing inside the case. It's not taking up a lot of space. 49 00:04:39,555 --> 00:04:46,221 And the GPUs are these two big monster things that are taking up a gigantic amount of space in the case. 50 00:04:46,221 --> 00:04:50,296 They have their own cooling, they're taking a lot of power. They're quite large. 51 00:04:50,296 --> 00:04:59,139 So, just in terms of how much power they're using, in terms of how big they are, the GPUs are kind of physically imposing and taking up a lot of space in the case. 52 00:04:59,139 --> 00:05:04,516 So the question is what are these things and why are they so important for deep learning? 53 00:05:04,516 --> 00:05:08,937 Well, the GPU is called a graphics card, or Graphics Processing Unit. 54 00:05:08,937 --> 00:05:16,166 And these were really developed, originally for rendering computer graphics, and especially around games and that sort of thing. 55 00:05:16,166 --> 00:05:23,247 So another show of hands, who plays video games at home sometimes, from time to time on their computer? 56 00:05:23,247 --> 00:05:25,693 Yeah, so again, maybe about half, good fraction. 57 00:05:25,693 --> 00:05:32,196 So for those of you who've played video games before and who've built your own computers, you probably have your own opinions on this debate. 58 00:05:32,196 --> 00:05:34,095 [laughs] 59 00:05:34,095 --> 00:05:37,666 So this is one of those big debates in computer science. 60 00:05:37,666 --> 00:05:42,620 You know, there's like Intel versus AMD, NVIDIA versus AMD for graphics cards. 61 00:05:42,620 --> 00:05:45,394 It's up there with Vim versus Emacs for text editor. 62 00:05:45,394 --> 00:05:51,945 And pretty much any gamer has their own opinions on which of these two sides they prefer for their own cards. 63 00:05:51,945 --> 00:05:59,116 And in deep learning we kind of have mostly picked one side of this fight, and that's NVIDIA. 64 00:05:59,116 --> 00:06:05,117 So if you guys have AMD cards, you might be in a little bit more trouble if you want to use those for deep learning. 65 00:06:05,117 --> 00:06:08,812 And really, NVIDIA's been pushing a lot for deep learning in the last several years. 66 00:06:08,812 --> 00:06:11,997 It's been kind of a large focus of some of their strategy. 67 00:06:11,997 --> 00:06:19,354 And they put in a lot effort into engineering sort of good solutions to make their hardware better suited for deep learning. 68 00:06:19,354 --> 00:06:27,718 So most people in deep learning when we talk about GPUs, we're pretty much exclusively talking about NVIDIA GPUs. 69 00:06:27,718 --> 00:06:35,268 Maybe in the future this'll change a little bit, and there might be new players coming up, but at least for now NVIDIA is pretty dominant. 70 00:06:35,268 --> 00:06:41,705 So to give you an idea of like what is the difference between a CPU and a GPU, I've kind of made a little spread sheet here. 71 00:06:41,705 --> 00:06:52,079 On the top we have two of the kind of top end Intel consumer CPUs, and on the bottom we have two of NVIDIA's sort of current top end consumer GPUs. 72 00:06:52,079 --> 00:06:55,975 And there's a couple general trends to notice here. 73 00:06:55,975 --> 00:07:03,284 Both GPUs and CPUs are kind of a general purpose computing machine where they can execute programs and do sort of arbitrary instructions, 74 00:07:03,284 --> 00:07:05,987 but they're qualitatively pretty different. 75 00:07:05,987 --> 00:07:16,714 So CPUs tend to have just a few cores, for consumer desktop CPUs these days, they might have something like four or six or maybe up to 10 cores. 76 00:07:16,714 --> 00:07:24,893 With hyperthreading technology that means they can run, the hardware can physically run, like maybe eight or up to 20 threads concurrently. 77 00:07:24,893 --> 00:07:29,700 So the CPU can maybe do 20 things in parallel at once. 78 00:07:29,700 --> 00:07:34,527 So that's just not a gigantic number, but those threads for a CPU are pretty powerful. 79 00:07:34,527 --> 00:07:37,223 They can actually do a lot of things, they're very fast. 80 00:07:37,223 --> 00:07:43,011 Every CPU instruction can actually do quite a lot of stuff. And they can all work pretty independently. 81 00:07:43,011 --> 00:07:51,909 For GPUs it's a little bit different. So for GPUs we see that these sort of common top end consumer GPUs have thousands of cores. 82 00:07:51,909 --> 00:08:00,412 So the NVIDIA Titan XP which is the current top of the line consumer GPU has 3840 cores. So that's a crazy number. 83 00:08:02,223 --> 00:08:06,357 That's like way more than the 10 cores that you'll get for a similarly priced CPU. 84 00:08:06,357 --> 00:08:12,207 The downside of a GPU is that each of those cores, one, it runs at a much slower clock speed. 85 00:08:12,207 --> 00:08:14,439 And two they really can't do quite as much. 86 00:08:14,439 --> 00:08:19,680 You can't really compare CPU cores and GPU cores apples to apples. 87 00:08:19,680 --> 00:08:22,510 The GPU cores can't really operate very independently. 88 00:08:22,510 --> 00:08:29,297 They all kind of need to work together and sort of parallize one task across many cores rather than each core totally doing its own thing. 89 00:08:29,297 --> 00:08:32,405 So you can't really compare these numbers directly. 90 00:08:32,405 --> 00:08:41,370 But it should give you the sense that due to the large number of cores GPUs can sort of, are really good for parallel things where you need to do a lot of things all at the same time, 91 00:08:41,370 --> 00:08:44,742 but those things are all pretty much the same flavor. 92 00:08:44,742 --> 00:08:49,387 Another thing to point out between CPUs and GPUs is this idea of memory. 93 00:08:49,387 --> 00:08:58,523 Right, so CPUs have some cache on the CPU, but that's relatively small and the majority of the memory for your CPU is pulling from your 94 00:08:58,523 --> 00:09:06,589 system memory, the RAM, which will maybe be like eight, 12, 16, 32 gigabytes of RAM on a typical consumer desktop these days. 95 00:09:06,589 --> 00:09:10,646 Whereas GPUs actually have their own RAM built into the chip. 96 00:09:12,055 --> 00:09:22,675 There's a pretty large bottleneck communicating between the RAM in your system and the GPU, so the GPUs typically have their own relatively large block of memory within the card itself. 97 00:09:23,955 --> 00:09:33,481 And for the Titan XP, which again is maybe the current top of the line consumer card, this thing has 12 gigabytes of memory local to the GPU. 98 00:09:33,481 --> 00:09:41,790 GPUs also have their own caching system where there are sort of multiple hierarchies of caching between the 12 gigabytes of GPU memory and the actual GPU cores. 99 00:09:41,790 --> 00:09:46,908 And that's somewhat similar to the caching hierarchy that you might see in a CPU. 100 00:09:47,985 --> 00:09:52,583 So, CPUs are kind of good for general purpose processing. They can do a lot of different things. 101 00:09:52,583 --> 00:09:57,089 And GPUs are maybe more specialized for these highly paralyzable algorithms. 102 00:09:57,089 --> 00:10:04,106 So the prototypical algorithm of something that works really really well and is like perfectly suited to a GPU is matrix multiplication. 103 00:10:04,106 --> 00:10:14,348 So remember in matrix multiplication on the left we've got like a matrix composed of a bunch of rows. We multiply that on the right by another matrix composed of a bunch of columns and then this produces 104 00:10:14,348 --> 00:10:25,009 another, a final matrix where each element in the output matrix is a dot product between one of the rows and one of the columns of the two input matrices. And these dot products are all independent. 105 00:10:25,009 --> 00:10:33,653 Like you could imagine, for this output matrix you could split it up completely and have each of those different elements of the output matrix all being computed in parallel 106 00:10:33,653 --> 00:10:38,289 and they all sort of are running the same computation which is taking a dot product of these two vectors. 107 00:10:38,289 --> 00:10:44,177 But exactly where they're reading that data from is from different places in the two input matrices. 108 00:10:44,177 --> 00:10:55,166 So you could imagine that for a GPU you can just like blast this out and have all of this elements of the output matrix all computed in parallel and that could make this thing computer super super fast on GPU. 109 00:10:55,166 --> 00:11:04,940 So that's kind of the prototypical type of problem that like where a GPU is really well suited, where a CPU might have to go in and step through sequentially and compute each of these elements one by one. 110 00:11:06,337 --> 00:11:13,829 That picture is a little bit of a caricature because CPUs these days have multiple cores, they can do vectorized instructions as well, 111 00:11:13,829 --> 00:11:19,568 but still, for these like massively parallel problems GPUs tend to have much better throughput. 112 00:11:19,568 --> 00:11:25,404 Especially when these matrices get really really big. And by the way, convolution is kind of the same kind of story. 113 00:11:25,404 --> 00:11:36,359 Where you know in convolution we have this input tensor, we have this weight tensor and then every point in the output tensor after a convolution is again some inner product between some part of the weights and some part of the input. 114 00:11:36,359 --> 00:11:43,354 And you can imagine that a GPU could really parallize this computation, split it all up across the many cores and compute it very quickly. 115 00:11:43,354 --> 00:11:49,510 So that's kind of the general flavor of the types of problems where GPUs give you a huge speed advantage over CPUs. 116 00:11:51,695 --> 00:11:55,498 So you can actually write programs that run directly on GPUs. 117 00:11:55,498 --> 00:12:03,614 So NVIDIA has this CUDA abstraction that lets you write code that kind of looks like C, but executes directly on the GPUs. 118 00:12:03,614 --> 00:12:05,484 But CUDA code is really really tricky. 119 00:12:05,484 --> 00:12:12,002 It's actually really tough to write CUDA code that's performant and actually squeezes all the juice out of these GPUs. 120 00:12:12,002 --> 00:12:19,163 You have to be very careful managing the memory hierarchy and making sure you don't have cache misses and branch mispredictions and all that sort of stuff. 121 00:12:19,163 --> 00:12:22,930 So it's actually really really hard to write performant CUDA code on your own. 122 00:12:22,930 --> 00:12:32,537 So as a result NVIDIA has released a lot of libraries that implement common computational primitives that are very very highly optimized for GPUs. 123 00:12:32,537 --> 00:12:40,610 So for example NVIDIA has a cuBLAS library that implements different kinds of matrix multiplications and different matrix operations that are super optimized, 124 00:12:40,610 --> 00:12:46,438 run really well on GPU, get very close to sort of theoretical peak hardware utilization. 125 00:12:46,438 --> 00:12:54,499 Similarly they have a cuDNN library which implements things like convolution, forward and backward passes, batch normalization, recurrent networks, 126 00:12:54,499 --> 00:12:57,454 all these kinds of computational primitives that we need in deep learning. 127 00:12:57,454 --> 00:13:03,842 NVIDIA has gone in there and released their own binaries that compute these primitives very efficiently on NVIDIA hardware. 128 00:13:03,842 --> 00:13:09,624 So in practice, you tend not to end up writing your own CUDA code for deep learning. 129 00:13:09,624 --> 00:13:14,173 You typically are just mostly calling into existing code that other people have written. 130 00:13:14,173 --> 00:13:19,573 Much of which is the stuff which has been heavily optimized by NVIDIA already. 131 00:13:19,573 --> 00:13:23,693 There's another sort of language called OpenCL which is a bit more general. 132 00:13:23,693 --> 00:13:29,185 Runs on more than just NVIDIA GPUs, can run on AMD hardware, can run on CPUs, 133 00:13:29,185 --> 00:13:43,938 but OpenCL, nobody's really spent a really large amount of effort and energy trying to get optimized deep learning primitives for OpenCL, so it tends to be a lot less performant the super optimized versions in CUDA. 134 00:13:43,938 --> 00:13:51,839 So maybe in the future we might see a bit of a more open standard and we might see this across many different more types of platforms, but at least for now, 135 00:13:51,839 --> 00:13:55,488 NVIDIA's kind of the main game in town for deep learning. 136 00:13:55,488 --> 00:14:01,686 So you can check, there's a lot of different resources for learning about how you can do GPU programming yourself. It's kind of fun. 137 00:14:01,686 --> 00:14:05,900 It's sort of a different paradigm of writing code because it's this massively parallel architecture, 138 00:14:05,900 --> 00:14:08,023 but that's a bit beyond the scope of this course. 139 00:14:08,023 --> 00:14:12,263 And again, you don't really need to write your own CUDA code much in practice for deep learning. 140 00:14:12,263 --> 00:14:16,600 And in fact, I've never written my own CUDA code for any research project, so, 141 00:14:16,600 --> 00:14:22,219 but it is kind of useful to know like how it works and what are the basic ideas even if you're not writing it yourself. 142 00:14:23,488 --> 00:14:29,168 So if you want to look at kind of CPU GPU performance in practice, I did some benchmarks last summer 143 00:14:29,168 --> 00:14:36,065 comparing a decent Intel CPU against a bunch of different GPUs that were sort of near top of the line at that time. 144 00:14:38,747 --> 00:14:48,954 And these were my own benchmarks that you can find more details on GitHub, but my findings were that for things like VGG 16 and 19, ResNets, various ResNets, 145 00:14:49,830 --> 00:14:57,114 then you typically see something like a 65 to 75 times speed up when running the exact same computation 146 00:14:57,114 --> 00:15:00,984 on a top of the line GPU, in this case a Pascal Titan X, 147 00:15:00,984 --> 00:15:08,604 versus a top of the line, well, not quite top of the line CPU, which in this case was an Intel E5 processor. 148 00:15:08,604 --> 00:15:15,550 Although, I'd like to make one sort of caveat here is that you always need to be super careful whenever you're reading any kind of benchmarks 149 00:15:15,550 --> 00:15:20,103 about deep learning, because it's super easy to be unfair between different things. 150 00:15:20,103 --> 00:15:26,339 And you kind of need to know a lot of the details about what exactly is being benchmarked in order to know whether or not the comparison is fair. 151 00:15:26,339 --> 00:15:35,855 So in this case I'll come right out and tell you that probably this comparison is a little bit unfair to CPU because I didn't spend a lot of effort 152 00:15:35,855 --> 00:15:38,721 trying to squeeze the maximal performance out of CPUs. 153 00:15:38,721 --> 00:15:42,483 I probably could have tuned the blast libraries better for the CPU performance. 154 00:15:42,483 --> 00:15:44,540 And I probably could have gotten these numbers a bit better. 155 00:15:44,540 --> 00:15:51,964 This was sort of out of the box performance between just installing Torch, running it on a CPU, just installing Torch running it on a GPU. 156 00:15:51,964 --> 00:15:57,872 So this is kind of out of the box performance, but it's not really like peak, possible, theoretical throughput on the CPU. 157 00:15:57,872 --> 00:16:02,422 But that being said, I think there are still pretty substantial speed ups to be had here. 158 00:16:02,422 --> 00:16:15,543 Another kind of interesting outcome from this benchmarking was comparing these optimized cuDNN libraries from NVIDIA for convolution and whatnot versus sort of more naive CUDA that had been hand written 159 00:16:15,543 --> 00:16:17,623 out in the open source community. 160 00:16:17,623 --> 00:16:24,653 And you can see that if you compare the same networks on the same hardware with the same deep learning framework and the only difference is swapping out 161 00:16:24,653 --> 00:16:37,442 these cuDNN versus sort of hand written, less optimized CUDA you can see something like nearly a three X speed up across the board when you switch from the relatively simple CUDA to these like super optimized cuDNN implementations. 162 00:16:37,442 --> 00:16:45,202 So in general, whenever you're writing code on GPU, you should probably almost always like just make sure you're using cuDNN because you're leaving probably 163 00:16:45,202 --> 00:16:51,602 a three X performance boost on the table if you're not calling into cuDNN for your stuff. 164 00:16:51,602 --> 00:17:02,882 So another problem that comes up in practice, when you're training these things is that you know, your model is maybe sitting on the GPU, the weights of the model are in that 12 gigabytes of local storage on the GPU, but your big dataset 165 00:17:02,882 --> 00:17:07,243 is sitting over on the right on a hard drive or an SSD or something like that. 166 00:17:07,243 --> 00:17:13,204 So if you're not careful you can actually bottleneck your training by just trying to read the data off the disk. 167 00:17:14,321 --> 00:17:23,002 'Cause the GPU is super fast, it can compute forward and backward quite fast, but if you're reading sequentially off a spinning disk, you can actually bottleneck your training quite, 168 00:17:23,002 --> 00:17:25,699 and that can be really bad and slow you down. 169 00:17:25,700 --> 00:17:31,459 So some solutions here are that like you know if your dataset's really small, sometimes you might just read the whole dataset into RAM. 170 00:17:31,459 --> 00:17:36,479 Or even if your dataset isn't so small, but you have a giant server with a ton of RAM, you might do that anyway. 171 00:17:36,479 --> 00:17:42,917 You can also make sure you're using an SSD instead of a hard drive, that can help a lot with read throughput. 172 00:17:42,917 --> 00:17:52,152 Another common strategy is to use multiple threads on the CPU that are pre-fetching data off RAM or off disk, buffering it in memory, in RAM so that 173 00:17:52,152 --> 00:17:57,724 then you can continue feeding that buffer data down to the GPU with good performance. 174 00:17:57,724 --> 00:18:08,804 This is a little bit painful to set up, but again like, these GPU's are so fast that if you're not really careful with trying to feed them data as quickly as possible, just reading the data can sometimes bottleneck the whole training process. 175 00:18:08,804 --> 00:18:11,657 So that's something to be aware of. 176 00:18:11,657 --> 00:18:17,432 So that's kind of the brief introduction to like sort of GPU CPU hardware in practice when it comes to deep learning. 177 00:18:17,432 --> 00:18:21,616 And then I wanted to switch gears a little bit and talk about the software side of things. 178 00:18:21,616 --> 00:18:25,006 The various deep learning frameworks that people are using in practice. 179 00:18:25,006 --> 00:18:28,819 But I guess before I move on, is there any sort of questions about CPU GPU? 180 00:18:28,819 --> 00:18:30,519 Yeah, question? 181 00:18:30,519 --> 00:18:34,686 [student's words obscured due to lack of microphone] 182 00:18:40,961 --> 00:18:45,854 Yeah, so the question is what can you sort of, what can you do mechanically when you're coding to avoid these problems? 183 00:18:45,854 --> 00:18:50,833 Probably the biggest thing you can do in software is set up sort of pre-fetching on the CPU. 184 00:18:50,833 --> 00:18:55,054 Like you couldn't like, sort of a naive thing would be you have this sequential process where you 185 00:18:55,054 --> 00:18:58,791 first read data off disk, wait for the data, wait for the minibatch to be read, 186 00:18:58,791 --> 00:19:02,458 then feed the minibatch to the GPU, then go forward and backward on the GPU, 187 00:19:02,458 --> 00:19:05,442 then read another minibatch and sort of do this all in sequence. 188 00:19:06,714 --> 00:19:15,469 And if you actually have multiple, like instead you might have CPU threads running in the background that are fetching data off the disk such that while the, 189 00:19:15,469 --> 00:19:17,076 you can sort of interleave all of these things. 190 00:19:17,076 --> 00:19:21,506 Like the GPU is computing, the CPU background threads are feeding data off disk 191 00:19:21,506 --> 00:19:28,534 and your main thread is kind of waiting for these things to, just doing a bit of synchronization between these things so they're all happening in parallel. 192 00:19:28,534 --> 00:19:38,016 And thankfully if you're using some of these deep learning frameworks that we're about to talk about, then some of this work has already been done for you 'cause it's a little bit painful. 193 00:19:38,016 --> 00:19:41,738 So the landscape of deep learning frameworks is super fast moving. 194 00:19:41,738 --> 00:19:47,915 So last year when I gave this lecture I talked mostly about Caffe, Torch, Theano and TensorFlow. 195 00:19:47,915 --> 00:20:00,232 - And when I last gave this talk, again more than a year ago, TensorFlow was relatively new. - It had not seen super widespread adoption yet at that time. But now I think in the last year TensorFlow 196 00:20:00,232 --> 00:20:06,310 has gotten much more popular. It's probably the main framework of choice for many people. So that's a big change. 197 00:20:07,342 --> 00:20:12,282 We've also seen a ton of new frameworks sort of popping up like mushrooms in the last year. 198 00:20:12,282 --> 00:20:18,052 So in particular Caffe2 and PyTorch are new frameworks from Facebook that I think are pretty interesting. 199 00:20:18,052 --> 00:20:20,409 There's also a ton of other frameworks. 200 00:20:20,409 --> 00:20:24,089 Paddle, Baidu has Paddle, Microsoft has CNTK, 201 00:20:24,089 --> 00:20:33,449 Amazon is mostly using MXNet and there's a ton of other frameworks as well, but I'm less familiar with, and really don't have time to get into. 202 00:20:33,449 --> 00:20:43,572 But one interesting thing to point out from this picture is that kind of the first generation of deep learning frameworks that really saw wide adoption were built in academia. 203 00:20:43,572 --> 00:20:49,388 So Caffe was from Berkeley, Torch was developed originally NYU and also in collaboration with Facebook. 204 00:20:49,388 --> 00:20:52,077 And Theana was mostly build at the University of Montreal. 205 00:20:52,077 --> 00:20:56,491 But these kind of next generation deep learning frameworks all originated in industry. 206 00:20:56,491 --> 00:21:00,659 So Caffe2 is from Facebook, PyTorch is from Facebook. TensorFlow is from Google. 207 00:21:00,659 --> 00:21:08,925 So it's kind of an interesting shift that we've seen in the landscape over the last couple of years is that these ideas have really moved a lot from academia into industry. 208 00:21:08,925 --> 00:21:13,187 And now industry is kind of giving us these big powerful nice frameworks to work with. 209 00:21:14,147 --> 00:21:24,850 So today I wanted to mostly talk about PyTorch and TensorFlow 'cause I personally think that those are probably the ones you should be focusing on for a lot of research type problems these days. 210 00:21:24,850 --> 00:21:32,192 I'll also talk a bit about Caffe and Caffe2. But probably a little bit less emphasis on those. 211 00:21:32,192 --> 00:21:36,705 And before we move any farther, I thought I should make my own biases a little bit more explicit. 212 00:21:36,705 --> 00:21:43,501 So I have mostly, I've worked with Torch mostly for the last several years. And I've used it quite a lot, I like it a lot. 213 00:21:43,501 --> 00:21:48,568 And then in the last year I've mostly switched to PyTorch as my main research framework. 214 00:21:48,568 --> 00:21:52,306 So I have a little bit less experience with some of these others, especially TensorFlow, 215 00:21:52,306 --> 00:21:58,382 but I'll still try to do my best to give you a fair picture and a decent overview of these things. 216 00:21:58,382 --> 00:22:06,807 So, remember that in the last several lectures we've hammered this idea of computational graphs in sort of over and over. 217 00:22:06,807 --> 00:22:13,176 That whenever you're doing deep learning, you want to think about building some computational graph that computes whatever function that you want to compute. 218 00:22:13,176 --> 00:22:18,778 So in the case of a linear classifier you'll combine your data X and your weights W with a matrix multiply. 219 00:22:18,778 --> 00:22:22,832 You'll do some kind of hinge loss to maybe have, compute your loss. 220 00:22:22,832 --> 00:22:28,909 You'll have some regularization term and you imagine stitching together all these different operations into some graph structure. 221 00:22:28,909 --> 00:22:36,167 Remember that these graph structures can get pretty complex in the case of a big neural net, now there's many different layers, many different activations. 222 00:22:36,167 --> 00:22:39,687 Many different weights spread all around in a pretty complex graph. 223 00:22:39,687 --> 00:22:47,328 And as you move to things like neural turing machines then you can get these really crazy computational graphs that you can't even really draw because they're so big and messy. 224 00:22:48,349 --> 00:22:58,727 So the point of deep learning frameworks is really, there's really kind of three main reasons why you might want to use one of these deep learning frameworks rather than just writing your own code. 225 00:22:58,727 --> 00:23:08,610 So the first would be that these frameworks enable you to easily build and work with these big hairy computational graphs without kind of worrying about a lot of those bookkeeping details yourself. 226 00:23:08,610 --> 00:23:13,956 Another major idea is that, whenever we're working in deep learning we always need to compute gradients. 227 00:23:14,812 --> 00:23:18,900 We're always computing some loss, we're always computer gradient of our weight with respect to the loss. 228 00:23:18,900 --> 00:23:26,115 And we'd like to make this automatically computing gradient, you don't want to have to write that code yourself. 229 00:23:26,115 --> 00:23:36,539 You want that framework to handle all these back propagation details for you so you can just think about writing down the forward pass of your network and have the backward pass sort of come out for free without any additional work. 230 00:23:36,539 --> 00:23:42,000 And finally you want all this stuff to run efficiently on GPUs so you don't have to worry too much about these 231 00:23:42,000 --> 00:23:48,389 low level hardware details about cuBLAS and cuDNN and CUDA and moving data between the CPU and GPU memory. 232 00:23:48,389 --> 00:23:52,439 You kind of want all those messy details to be taken care of for you. 233 00:23:52,439 --> 00:23:59,450 So those are kind of some of the major reasons why you might choose to use frameworks rather than writing your own stuff from scratch. 234 00:23:59,450 --> 00:24:05,231 So as kind of a concrete example of a computational graph we can maybe write down this super simple thing. 235 00:24:05,231 --> 00:24:13,071 Where we have three inputs, X, Y, and Z. We're going to combine X and Y to produce A. Then we're going to combine A and Z to produce B 236 00:24:13,071 --> 00:24:18,630 and then finally we're going to do some maybe summing out operation on B to give some scaler final result C. 237 00:24:18,630 --> 00:24:31,631 So you've probably written enough Numpy code at this point to realize that it's super easy to write down, to implement this computational graph, or rather to implement this bit of computation in Numpy, right? 238 00:24:31,631 --> 00:24:41,923 You can just kind of write down in Numpy that you want to generate some random data, you want to multiply two things, you want to add two things, you want to sum out a couple things. And it's really easy to do this in Numpy. 239 00:24:41,923 --> 00:24:48,355 But then the question is like suppose that we want to compute the gradient of C with respect to X, Y, and Z. 240 00:24:48,355 --> 00:24:52,725 So, if you're working in Numpy, you kind of need to write out this backward pass yourself. 241 00:24:52,725 --> 00:25:02,859 And you've gotten a lot of practice with this on the homeworks, but it can be kind of a pain and a little bit annoying and messy once you get to really big complicated things. 242 00:25:02,859 --> 00:25:05,675 The other problem with Numpy is that it doesn't run on the GPU. 243 00:25:05,675 --> 00:25:14,920 So Numpy is definitely CPU only. And you're never going to be able to experience or take advantage of these GPU accelerated speedups if you're stuck working in Numpy. 244 00:25:14,920 --> 00:25:19,527 And it's, again, it's a pain to have to compute your own gradients in all these situations. 245 00:25:19,527 --> 00:25:29,047 So, kind of the goal of most deep learning frameworks these days is to let you write code in the forward pass that looks very similar to Numpy, 246 00:25:29,047 --> 00:25:33,069 but lets you run it on the GPU and lets you automatically compute gradients. 247 00:25:33,069 --> 00:25:36,397 And that's kind of the big picture goal of most of these frameworks. 248 00:25:36,397 --> 00:25:44,314 So if you imagine looking at, if we look at an example in TensorFlow of the exact same computational graph, we now see that in this forward pass, 249 00:25:44,314 --> 00:25:52,687 you write this code that ends up looking very very similar to the Numpy forward pass where you're kind of doing these multiplication and these addition operations. 250 00:25:52,687 --> 00:25:57,623 But now TensorFlow has this magic line that just computes all the gradients for you. 251 00:25:57,623 --> 00:26:02,235 So now you don't have go in and write your own backward pass and that's much more convenient. 252 00:26:02,235 --> 00:26:08,926 The other nice thing about TensorFlow is you can really just, like with one line you can switch all this computation between CPU and GPU. 253 00:26:08,926 --> 00:26:16,668 So here, if you just add this with statement before you're doing this forward pass, you just can explicitly tell the framework, hey I want to run this code on the CPU. 254 00:26:16,668 --> 00:26:24,866 But now if we just change that with statement a little bit with just with a one character change in this case, changing that C to a G, now the code runs on GPU. 255 00:26:24,866 --> 00:26:31,388 And now in this little code snippet, we've solved these two problems. We're running our code on the GPU 256 00:26:31,388 --> 00:26:35,685 and we're having the framework compute all the gradients for us, so that's really nice. 257 00:26:35,685 --> 00:26:38,459 And PyTorch kind looks almost exactly the same. 258 00:26:38,459 --> 00:26:42,509 So again, in PyTorch you kind of write down, you define some variables, 259 00:26:42,509 --> 00:26:49,262 you have some forward pass and the forward pass again looks very similar to like, in this case identical to the Numpy code. 260 00:26:49,262 --> 00:26:56,251 And then again, you can just use PyTorch to compute gradients, all your gradients with just one line. 261 00:26:56,251 --> 00:27:06,781 And now in PyTorch again, it's really easy to switch to GPU, you just need to cast all your stuff to the CUDA data type before you rung your computation and now everything runs transparently on the GPU for you. 262 00:27:06,781 --> 00:27:13,878 So if you kind of just look at these three examples, these three snippets of code side by side, the Numpy, the TensorFlow and the PyTorch 263 00:27:13,878 --> 00:27:20,564 you see that the TensorFlow and the PyTorch code in the forward pass looks almost exactly like Numpy 264 00:27:20,564 --> 00:27:24,349 which is great 'cause Numpy has a beautiful API, it's really easy to work with. 265 00:27:24,349 --> 00:27:29,192 But we can compute gradients automatically and we can run the GPU automatically. 266 00:27:30,186 --> 00:27:37,502 So after that kind of introduction, I wanted to dive in and talk in a little bit more detail about kind of what's going on inside this TensorFlow example. 267 00:27:37,502 --> 00:27:50,662 So as a running example throughout the rest of the lecture, I'm going to use the training a two-layer fully connected ReLU network on random data as kind of a running example throughout the rest of the examples here. 268 00:27:50,662 --> 00:27:55,289 And we're going to train this thing with an L2 Euclidean loss on random data. 269 00:27:55,289 --> 00:28:08,966 So this is kind of a silly network, it's not really doing anything useful, but it does give you the, it's relatively small, self contained, the code fits on the slide without being too small, and it lets you demonstrate kind of a lot of the useful ideas inside these frameworks. 270 00:28:08,966 --> 00:28:15,900 So here on the right, oh, and then another note, I'm kind of assuming that Numpy and TensorFlow have already been imported in all these code snippets. 271 00:28:15,900 --> 00:28:21,163 So in TensorFlow you would typically divide your computation into two major stages. 272 00:28:21,163 --> 00:28:28,363 First, we're going to write some code that defines our computational graph, and that's this red code up in the top half. 273 00:28:28,363 --> 00:28:32,360 And then after you define your graph, you're going to run the graph over and over again 274 00:28:32,360 --> 00:28:36,851 and actually feed data into the graph to perform whatever computation you want it to perform. 275 00:28:36,851 --> 00:28:40,961 So this is the really, this is kind of the big common pattern in TensorFlow. 276 00:28:40,961 --> 00:28:46,615 You'll first have a bunch of code that builds the graph and then you'll go and run the graph and reuse it many many times. 277 00:28:48,099 --> 00:28:52,763 So if you kind of dive into the code of building the graph in this case. 278 00:28:52,763 --> 00:29:00,709 Up at the top you see that we're defining this X, Y, w1 and w2, and we're creating these tf.placeholder objects. 279 00:29:01,637 --> 00:29:05,193 So these are going to be input nodes to the graph. 280 00:29:05,193 --> 00:29:15,379 These are going to be sort of entry points to the graph where when we run the graph, we're going to feed in data and put them in through these input slots in our computational graph. 281 00:29:15,379 --> 00:29:21,944 So this is not actually like allocating any memory right now. We're just sort of setting up these input slots to the graph. 282 00:29:23,272 --> 00:29:28,665 Then we're going to use those input slots which are now kind of like these symbolic variables 283 00:29:28,665 --> 00:29:37,135 and we're going to perform different TensorFlow operations on these symbolic variables in order to set up what computation we want to run on those variables. 284 00:29:37,135 --> 00:29:46,109 So in this case we're doing a matrix multiplication between X and w1, we're doing some tf.maximum to do a ReLU nonlinearity and then we're doing another 285 00:29:46,109 --> 00:29:49,240 matrix multiplication to compute our output predictions. 286 00:29:49,240 --> 00:29:58,175 And then we're again using a sort of basic Tensor operations to compute our Euclidean distance, our L2 loss between our prediction and the target Y. 287 00:29:58,175 --> 00:30:05,824 Another thing to point out here is that these lines of code are not actually computing anything. There's no data in the system right now. 288 00:30:05,824 --> 00:30:15,001 We're just building up this computational graph data structure telling TensorFlow which operations we want to eventually run once we put in real data. 289 00:30:15,001 --> 00:30:18,648 So this is just building the graph, this is not actually doing anything. 290 00:30:18,648 --> 00:30:33,135 Then we have this magical line where after we've computed our loss with these symbolic operations, then we can just ask TensorFlow to compute the gradient of the loss with respect to w1 and w2 in this one magical, beautiful line. 291 00:30:33,135 --> 00:30:37,981 And this avoids you writing all your own backprop code that you had to do in the assignments. 292 00:30:37,981 --> 00:30:40,439 But again there's no actual computation happening here. 293 00:30:40,439 --> 00:30:51,108 This is just sort of adding extra operations to the computational graph where now the computational graph has these additional operations which will end up computing these gradients for you. 294 00:30:51,108 --> 00:31:01,421 So now at this point we've computed our computational graph, we have this big graph in this graph data structure in memory that knows what operations we want to perform to compute the loss in gradients. 295 00:31:01,421 --> 00:31:06,843 And now we enter a TensorFlow session to actually run this graph and feed it with data. 296 00:31:06,843 --> 00:31:13,859 So then, once we've entered the session, then we actually need to construct some concrete values that will be fed to the graph. 297 00:31:13,859 --> 00:31:19,459 So TensorFlow just expects to receive data from Numpy arrays in most cases. 298 00:31:19,459 --> 00:31:30,226 So here we're just creating concrete actual values for X, Y, w1 and w2 using Numpy and then storing these in some dictionary. 299 00:31:30,226 --> 00:31:32,743 And now here is where we're actually running the graph. 300 00:31:32,743 --> 00:31:38,120 So you can see that we're calling a session.run to actually execute some part of the graph. 301 00:31:38,120 --> 00:31:43,899 The first argument loss, tells us which part of the graph do we actually want as output. 302 00:31:43,899 --> 00:31:50,950 And that, so we actually want the graph, in this case we need to tell it that we actually want to compute loss and grad1 and grad w2 303 00:31:50,950 --> 00:31:57,140 and we need to pass in with this feed dict parameter the actual concrete values that will be fed to the graph. 304 00:31:57,140 --> 00:32:06,541 And then after, in this one line, it's going and running the graph and then computing those values for loss grad1 to grad w2 305 00:32:06,541 --> 00:32:12,003 and then returning the actual concrete values for those in Numpy arrays again. 306 00:32:12,003 --> 00:32:19,859 So now after you unpack this output in the second line, you get Numpy arrays, or you get Numpy arrays with the loss and the gradients. 307 00:32:19,859 --> 00:32:23,697 So then you can go and do whatever you want with these values. 308 00:32:23,697 --> 00:32:29,599 So then, this has only run sort of one forward and backward pass through our graph, 309 00:32:29,599 --> 00:32:33,167 and it only takes a couple extra lines if we actually want to train the network. 310 00:32:33,167 --> 00:32:45,511 So here we're, now we're running the graph many times in a loop so we're doing a four loop and in each iteration of the loop, we're calling session.run asking it to compute the loss and the gradients. 311 00:32:45,511 --> 00:32:52,291 And now we're doing a manual gradient discent step using those computed gradients to now update our current values of the weights. 312 00:32:52,291 --> 00:33:00,749 So if you actually run this code and plot the losses, then you'll see that the loss goes down and the network is training and this is working pretty well. 313 00:33:00,749 --> 00:33:06,113 So this is kind of like a super bare bones example of training a fully connected network in TensorFlow. 314 00:33:06,113 --> 00:33:08,046 But there's a problem here. 315 00:33:08,046 --> 00:33:15,086 So here, remember that on the forward pass, every time we execute this graph, we're actually feeding in the weights. 316 00:33:15,086 --> 00:33:18,835 We have the weights as Numpy arrays and we're explicitly feeding them into the graph. 317 00:33:18,835 --> 00:33:26,339 And now when the graph finishes executing it's going to give us these gradients. And remember the gradients are the same size as the weights. 318 00:33:26,339 --> 00:33:32,665 So this means that every time we're running the graph here, we're copying the weights from Numpy arrays into TensorFlow then getting the gradients 319 00:33:32,665 --> 00:33:36,419 and then copying the gradients from TensorFlow back out to Numpy arrays. 320 00:33:36,419 --> 00:33:39,849 So if you're just running on CPU, this is maybe not a huge deal, 321 00:33:39,849 --> 00:33:47,235 but remember we talked about CPU GPU bottleneck and how it's very expensive actually to copy data between CPU memory and GPU memory. 322 00:33:47,235 --> 00:33:59,256 So if your network is very large and your weights and gradients were very big, then doing something like this would be super expensive and super slow because we'd be copying all kinds of data back and forth between the CPU and the GPU at every time step. 323 00:33:59,256 --> 00:34:01,689 So that's bad, we don't want to do that. We need to fix that. 324 00:34:01,689 --> 00:34:06,027 So, obviously TensorFlow has some solution to this. 325 00:34:06,027 --> 00:34:17,969 And the idea is that now we want our weights, w1 and w2, rather than being placeholders where we're going to, where we expect to feed them in to the network on every forward pass, instead we define them as variables. 326 00:34:17,969 --> 00:34:27,346 So a variable is something is a value that lives inside the computational graph and it's going to persist inside the computational graph across different times when you run the same graph. 327 00:34:27,347 --> 00:34:33,094 So now instead of declaring these w1 and w2 as placeholders, instead we just construct them as variables. 328 00:34:33,094 --> 00:34:39,219 But now since they live inside the graph, we also need to tell TensorFlow how they should be initialized, right? 329 00:34:39,219 --> 00:34:44,606 Because in the previous case we were feeding in their values from outside the graph, so we initialized them in Numpy, 330 00:34:44,606 --> 00:34:50,569 but now because these things live inside the graph, TensorFlow is responsible for initializing them. 331 00:34:50,569 --> 00:34:53,149 So we need to pass in a tf.randomnormal operation, 332 00:34:53,149 --> 00:35:00,627 which again is not actually initializing them when we run this line, this is just telling TensorFlow how we want them to be initialized. 333 00:35:00,627 --> 00:35:03,215 So it's a little bit of confusing misdirection going on here. 334 00:35:04,869 --> 00:35:11,862 And now, remember in the previous example we were actually updating the weights outside of the computational graph. 335 00:35:11,862 --> 00:35:17,219 We, in the previous example, we were computing the gradients and then using them to update the weights as Numpy arrays 336 00:35:17,219 --> 00:35:20,264 and then feeding in the updated weights at the next time step. 337 00:35:20,264 --> 00:35:29,402 But now because we want these weights to live inside the graph, this operation of updating the weights needs to also be an operation inside the computational graph. 338 00:35:29,402 --> 00:35:37,020 So now we used this assign function which mutates these variables inside the computational graph 339 00:35:37,020 --> 00:35:41,487 and now the mutated value will persist across multiple runs of the same graph. 340 00:35:41,487 --> 00:35:45,976 So now when we run this graph and when we train the network, 341 00:35:45,976 --> 00:35:53,825 now we need to run the graph once with a little bit of special incantation to tell TensorFlow to set up these variables that are going to live inside the graph. 342 00:35:53,825 --> 00:35:58,574 And then once we've done that initialization, now we can run the graph over and over again. 343 00:35:58,574 --> 00:36:05,091 And here, we're now only feeding in the data and labels X and Y and the weights are living inside the graph. 344 00:36:05,091 --> 00:36:09,517 And here we've asked the network to, we've asked TensorFlow to compute the loss for us. 345 00:36:09,517 --> 00:36:13,001 And then you might think that this would train the network, 346 00:36:13,001 --> 00:36:19,964 but there's actually a bug here. So, if you actually run this code, and you plot the loss, it doesn't train. 347 00:36:19,964 --> 00:36:23,401 So that's bad, it's confusing, like what's going on? 348 00:36:23,401 --> 00:36:29,957 We wrote this assign code, we ran the thing, like we computed the loss and the gradients and our loss is flat, what's going on? 349 00:36:29,957 --> 00:36:31,460 Any ideas? 350 00:36:31,460 --> 00:36:34,595 [student's words obscured due to lack of microphone] 351 00:36:34,595 --> 00:36:44,979 Yeah so one hypothesis is that maybe we're accidentally re-initializing the w's every time we call the graph. That's a good hypothesis, that's actually not the problem in this case. 352 00:36:44,979 --> 00:36:48,057 [student's words obscured due to lack of microphone] 353 00:36:48,057 --> 00:36:56,318 Yeah, so the answer is that we actually need to explicitly tell TensorFlow that we want to run these new w1 and new w2 operations. 354 00:36:56,318 --> 00:36:58,835 So we've built up this big computational graph data 355 00:36:58,835 --> 00:37:01,699 structure in memory and now when we call run, 356 00:37:01,699 --> 00:37:04,894 we only told TensorFlow that we wanted to compute loss. 357 00:37:04,894 --> 00:37:09,155 And if you look at the dependencies among these different operations inside the graph, 358 00:37:09,155 --> 00:37:13,715 you see that in order to compute loss we don't actually need to perform this update operation. 359 00:37:13,715 --> 00:37:21,496 So TensorFlow is smart and it only computes the parts of the graph that are necessary for computing the output that you asked it to compute. 360 00:37:21,496 --> 00:37:26,656 So that's kind of a nice thing because it means it's only doing as much work as it needs to, 361 00:37:26,656 --> 00:37:32,739 but in situations like this it can be a little bit confusing and lead to behavior that you didn't expect. 362 00:37:32,739 --> 00:37:39,141 So the solution in this case is that we actually need to explicitly tell TensorFlow to perform those update operations. 363 00:37:39,141 --> 00:37:49,531 So one thing we could do, which is what was suggested is we could add new w1 and new w2 as outputs and just tell TensorFlow that we want to produce these values as outputs. 364 00:37:49,531 --> 00:37:57,366 But that's a problem too because the values, those new w1, new w2 values are again these big tensors. 365 00:37:58,891 --> 00:38:05,138 So now if we tell TensorFlow we want those as output, we're going to again get this copying behavior between CPU and GPU at ever iteration. 366 00:38:05,138 --> 00:38:07,316 So that's bad, we don't want that. 367 00:38:07,316 --> 00:38:11,742 So there's a little trick you can do instead. Which is that we add kind of a dummy node to the graph. 368 00:38:11,742 --> 00:38:20,307 With these fake data dependencies and we just say that this dummy node updates, has these data dependencies of new w1 and new w2. 369 00:38:20,307 --> 00:38:25,803 And now when we actually run the graph, we tell it to compute both the loss and this dummy node. 370 00:38:25,803 --> 00:38:38,468 And this dummy node doesn't actually return any value it just returns none, but because of this dependency that we've put into the node it ensures that when we run the updates value, we actually also run these update operations. 371 00:38:38,468 --> 00:38:39,551 So, question? 372 00:38:40,788 --> 00:38:44,955 [student's words obscured due to lack of microphone] 373 00:38:45,854 --> 00:38:51,370 Is there a reason why we didn't put X and Y into the graph? And that it stayed as Numpy. 374 00:38:51,370 --> 00:38:57,151 So in this example we're reusing X and Y on every, we're reusing the same X and Y on every iteration. 375 00:38:57,151 --> 00:39:10,122 So you're right, we could have just also stuck those in the graph, but in a more realistic scenario, X and Y will be minibatches of data so those will actually change at every iteration and we will want to feed different values for those at every iteration. 376 00:39:10,122 --> 00:39:14,330 So in this case, they could have stayed in the graph, but in most cases they will change, 377 00:39:14,330 --> 00:39:17,913 so we don't want them to live in the graph. 378 00:39:19,388 --> 00:39:21,290 Oh, another question? 379 00:39:21,290 --> 00:39:25,457 [student's words obscured due to lack of microphone] 380 00:39:37,046 --> 00:39:44,305 Yeah, so we've told it, we had put into TensorFlow that the outputs we want are loss and updates. 381 00:39:44,305 --> 00:39:51,801 Updates is not actually a real value. So when updates evaluates it just returns none. 382 00:39:51,801 --> 00:39:57,416 But because of this dependency we've told it that updates depends on these assign operations. 383 00:39:57,416 --> 00:40:02,358 But these assign operations live inside the computational graph and all live inside GPU memory. 384 00:40:02,358 --> 00:40:10,190 So then we're doing these update operations entirely on the GPU and we're no longer copying the updated values back out of the graph. 385 00:40:11,723 --> 00:40:15,112 [student's words obscured due to lack of microphone] 386 00:40:15,112 --> 00:40:18,195 So the question is does tf.group return none? 387 00:40:18,195 --> 00:40:25,923 So this gets into the trickiness of TensorFlow. So tf.group returns some crazy TensorFlow value. 388 00:40:25,923 --> 00:40:32,658 It sort of returns some like internal TensorFlow node operation that we need to continue building the graph. 389 00:40:32,658 --> 00:40:43,333 But when you execute the graph, and when you tell, inside the session.run, when we told it we want it to compute the concrete value from updates, then that returns none. 390 00:40:43,333 --> 00:40:45,482 So whenever you're working with TensorFlow 391 00:40:45,482 --> 00:40:53,487 you have this funny indirection between building the graph and the actual output values during building the graph is some funny weird object, and then you actually get 392 00:40:53,487 --> 00:40:55,466 a concrete value when you run the graph. 393 00:40:55,466 --> 00:40:59,967 So here after you run updates, then the output is none. Does that clear it up a little bit? 394 00:40:59,967 --> 00:41:04,134 [student's words obscured due to lack of microphone] 395 00:41:18,796 --> 00:41:22,334 So the question is why is loss a value and why is updates none? 396 00:41:22,334 --> 00:41:24,068 That's just the way that updates works. 397 00:41:24,068 --> 00:41:30,176 So loss is a value when we compute, when we tell TensorFlow we want to run a tensor, then we get the concrete value. 398 00:41:30,176 --> 00:41:35,753 Updates is this kind of special other data type that does not return a value, it instead returns none. 399 00:41:35,753 --> 00:41:38,703 So it's kind of some TensorFlow magic that's going on there. 400 00:41:38,703 --> 00:41:40,602 Maybe we can talk offline if you're still confused. 401 00:41:40,602 --> 00:41:42,678 [student's words obscured due to lack of microphone] 402 00:41:42,678 --> 00:41:46,186 Yeah, yeah, that behavior is coming from the group method. 403 00:41:46,186 --> 00:41:52,492 So now, we kind of have this weird pattern where we wanted to do these different assign operations, we have to use this funny tf.group thing. 404 00:41:52,492 --> 00:42:00,004 That's kind of a pain, so thankfully TensorFlow gives you some convenience operations that kind of do that kind of stuff for you. 405 00:42:00,004 --> 00:42:01,706 And that's called an optimizer. 406 00:42:01,706 --> 00:42:06,047 So here we're using a tf.train.GradientDescentOptimizer 407 00:42:06,047 --> 00:42:08,458 and we're telling it what learning rate we want to use. 408 00:42:08,458 --> 00:42:12,784 And you can imagine that there's, there's RMSprop, there's all kinds of different optimization algorithms here. 409 00:42:12,784 --> 00:42:16,284 And now we call optimizer.minimize of loss 410 00:42:17,311 --> 00:42:21,204 and now this is a pretty magical, this is a pretty magical thing, 411 00:42:21,204 --> 00:42:30,586 because now this call is aware that these variables w1 and w2 are marked as trainable by default, so then internally, inside this optimizer.minimize 412 00:42:30,586 --> 00:42:35,184 it's going in and adding nodes to the graph which will compute gradient of loss with respect 413 00:42:35,184 --> 00:42:42,219 to w1 and w2 and then it's also performing that update operation for you and it's doing the grouping operation for you and it's doing the assigns. 414 00:42:42,219 --> 00:42:44,206 It's like doing a lot of magical stuff inside there. 415 00:42:44,206 --> 00:42:53,518 But then it ends up giving you this magical updates value which, if you dig through the code they're actually using tf.group so it looks very similar internally to what we saw before. 416 00:42:53,518 --> 00:43:00,004 And now when we run the graph inside our loop we do the same pattern of telling it to compute loss and updates. 417 00:43:00,004 --> 00:43:07,450 And every time we tell the graph to compute updates, then it'll actually go and update the graph. 418 00:43:07,450 --> 00:43:08,593 Question? 419 00:43:08,593 --> 00:43:10,959 [student's words obscured due to lack of microphone] 420 00:43:10,959 --> 00:43:14,249 Yeah, so what is the tf.GlobalVariablesInitializer? 421 00:43:14,249 --> 00:43:20,502 So that's initializing w1 and w2 because these are variables which live inside the graph. 422 00:43:20,502 --> 00:43:37,733 So we need to, when we saw this, when we create the tf.variable we have this tf.randomnormal which is this initialization so the tf.GlobalVariablesInitializer is causing the tf.randomnormal to actually run and generate concrete values to initialize those variables. 423 00:43:37,733 --> 00:43:40,794 [student's words obscured due to lack of microphone] 424 00:43:40,794 --> 00:43:42,271 Sorry, what was the question? 425 00:43:42,271 --> 00:43:45,233 [student's words obscured due to lack of microphone] 426 00:43:45,233 --> 00:43:51,385 So it knows that a placeholder is going to be fed outside of the graph and a variable is something that lives inside the graph. 427 00:43:51,385 --> 00:44:00,384 So I don't know all the details about how it decides, what exactly it decides to run with that call. I think you'd need to dig through the code to figure that out, or maybe it's documented somewhere. 428 00:44:00,384 --> 00:44:06,130 So but now we've kind of got this, again we've got this full example of training a network in TensorFlow and we're kind of adding 429 00:44:06,130 --> 00:44:09,328 bells and whistles to make it a little bit more convenient. 430 00:44:09,328 --> 00:44:16,954 So we can also here, in the previous example we were computing the loss explicitly using our own tensor operations, TensorFlow you can always do that, 431 00:44:16,954 --> 00:44:20,739 you can use basic tensor operations to compute just about anything you want. 432 00:44:20,739 --> 00:44:26,734 But TensorFlow also gives you a bunch of convenience functions that compute these common neural network things for you. 433 00:44:26,734 --> 00:44:30,040 So in this case we can use tf.losses.mean_squared_error 434 00:44:30,040 --> 00:44:36,273 and it just does the L2 loss for us so we don't have to compute it ourself in terms of basic tensor operations. 435 00:44:36,273 --> 00:44:46,667 So another kind of weirdness here is that it was kind of annoying that we had to explicitly define our inputs and define our weights and then like chain them together in the forward pass using a matrix multiply. 436 00:44:46,667 --> 00:44:54,291 And in this example we've actually not put biases in the layer because that would be kind of an extra, then we'd have to initialize biases, 437 00:44:54,291 --> 00:44:58,494 we'd have to get them in the right shape, we'd have to broadcast the biases against the output 438 00:44:58,494 --> 00:45:01,966 of the matrix multiply and you can see that that would kind of be a lot of code. 439 00:45:01,966 --> 00:45:03,664 It would be kind of annoying write. 440 00:45:03,664 --> 00:45:09,653 And once you get to like convolutions and batch normalizations and other types of layers this kind of basic way of working, 441 00:45:09,653 --> 00:45:17,403 of having these variables, having these inputs and outputs and combining them all together with basic computational graph operations could be a little bit 442 00:45:17,403 --> 00:45:22,954 unwieldy and it could be really annoying to make sure you initialize the weights with the right shapes and all that sort of stuff. 443 00:45:22,954 --> 00:45:30,615 So as a result, there's a bunch of sort of higher level libraries that wrap around TensorFlow and handle some of these details for you. 444 00:45:30,615 --> 00:45:35,965 So one example that ships with TensorFlow, is this tf.layers inside. 445 00:45:35,965 --> 00:45:44,060 So now in this code example you can see that our code is only explicitly declaring the X and the Y which are the placeholders for the data and the labels. 446 00:45:44,060 --> 00:45:53,036 And now we say that H=tf.layers.dense, we give it the input X and we tell it units=H. 447 00:45:53,036 --> 00:45:55,171 This is again kind of a magical line 448 00:45:55,171 --> 00:46:07,411 because inside this line, it's kind of setting up w1 and b1, the bias, it's setting up variables for those with the right shapes that are kind of inside the graph but a little bit hidden from us. 449 00:46:07,411 --> 00:46:12,931 And it's using this xavier initializer object to set up an initialization strategy for those. 450 00:46:12,931 --> 00:46:17,200 So before we were doing that explicitly ourselves with the tf.randomnormal business, 451 00:46:17,200 --> 00:46:22,266 but now here it's kind of handling some of those details for us and it's just spitting out an H, 452 00:46:22,266 --> 00:46:27,515 which is again the same sort of H that we saw in the previous layer, it's just doing some of those details for us. 453 00:46:28,487 --> 00:46:36,910 And you can see here, we're also passing an activation=tf.nn.relu so it's even doing the activation, the relu activation function inside this layer for us. 454 00:46:36,910 --> 00:46:41,370 So it's taking care of a lot of these architectural details for us. 455 00:46:41,370 --> 00:46:42,784 Question? 456 00:46:42,784 --> 00:46:46,446 [student's words obscured due to lack of microphone] 457 00:46:46,446 --> 00:46:51,168 Question is does the xavier initializer default to particular distribution? 458 00:46:51,168 --> 00:46:55,850 I'm sure it has some default, I'm not sure what it is. I think you'll have to look at the documentation. 459 00:46:55,850 --> 00:46:58,010 But it seems to be a reasonable strategy, I guess. 460 00:46:58,010 --> 00:47:04,111 And in fact if you run this code, it converges much faster than the previous one because the initialization is better. 461 00:47:04,111 --> 00:47:11,911 And you can see that we're using two calls to tf.layers and this lets us build our model without doing all these explicit bookkeeping details ourself. 462 00:47:11,911 --> 00:47:14,273 So this is maybe a little bit more convenient. 463 00:47:14,273 --> 00:47:18,682 But tf.contrib.layer is really not the only game in town. 464 00:47:18,682 --> 00:47:23,349 There's like a lot of different higher level libraries that people build on top of TensorFlow. 465 00:47:23,349 --> 00:47:26,841 And it's kind of due to this basic impotence mis-match 466 00:47:26,841 --> 00:47:30,315 where the computational graph is relatively low level thing, 467 00:47:30,315 --> 00:47:36,426 but when we're working with neural networks we have this concept of layers and weights and some layers have weights associated with them, 468 00:47:36,426 --> 00:47:41,866 and we typically think at a slightly higher level of abstraction than this raw computational graph. 469 00:47:41,866 --> 00:47:48,503 So that's what these various packages are trying to help you out and let you work at this higher layer of abstraction. 470 00:47:48,503 --> 00:47:52,460 So another very popular package that you may have seen before is Keras. 471 00:47:52,460 --> 00:48:02,806 Keras is a very beautiful, nice API that sits on top of TensorFlow and handles sort of building up these computational graph for you up in the back end. 472 00:48:02,806 --> 00:48:07,704 By the way, Keras also supports Theano as a back end, so that's also kind of nice. 473 00:48:07,704 --> 00:48:10,958 And in this example you can see we build the model as a sequence of layers. 474 00:48:10,958 --> 00:48:17,910 We build some optimizer object and we call model.compile and this does a lot of magic in the back end to build the graph. 475 00:48:17,910 --> 00:48:22,797 And now we can call model.fit and that does the whole training procedure for us magically. 476 00:48:22,797 --> 00:48:28,523 So I don't know all the details of how this works, but I know Keras is very popular, so you might consider using it if you're talking about TensorFlow. 477 00:48:29,797 --> 00:48:31,270 Question? 478 00:48:31,270 --> 00:48:35,437 [student's words obscured due to lack of microphone] 479 00:48:41,717 --> 00:48:45,525 Yeah, so the question is like why there's no explicit CPU, GPU going on here. 480 00:48:45,525 --> 00:48:48,409 So I've kind of left that out to keep the code clean. 481 00:48:48,409 --> 00:48:54,607 But you saw at the beginning examples it was pretty easy to flop all these things between CPU and GPU and there was either some global flag 482 00:48:54,607 --> 00:49:01,635 or some different data type or some with statement and it's usually relatively simple and just about one line to swap in each case. 483 00:49:01,635 --> 00:49:06,149 But exactly what that line looks like differs a bit depending on the situation. 484 00:49:06,149 --> 00:49:14,186 So there's actually like this whole large set of higher level TensorFlow wrappers that you might see out there in the wild. 485 00:49:14,186 --> 00:49:21,276 And it seems that like even people within Google can't really agree on which one is the right one to use. 486 00:49:22,230 --> 00:49:26,829 So Keras and TFLearn are third party libraries that are out there on the internet by other people. 487 00:49:26,829 --> 00:49:32,563 But there's these three different ones, tf.layers, TF-Slim and tf.contrib.learn 488 00:49:32,563 --> 00:49:39,727 that all ship with TensorFlow, that are all kind of doing a slightly different version of this higher level wrapper thing. 489 00:49:39,727 --> 00:49:46,291 There's another framework also from Google, but not shipping with TensorFlow called Pretty Tensor that does the same sort of thing. 490 00:49:46,291 --> 00:49:48,599 And I guess none of these were good enough for DeepMind, 491 00:49:48,599 --> 00:49:54,530 because they went ahead a couple weeks ago and wrote and released their very own high level TensorFlow wrapper called Sonnet. 492 00:49:54,530 --> 00:50:00,715 So I wouldn't begrudge you if you were kind of confused by all these things. There's a lot of different choices. 493 00:50:00,715 --> 00:50:07,423 They don't always play nicely with each other. But you have a lot of options, so that's good. 494 00:50:07,423 --> 00:50:09,123 TensorFlow has pretrained models. 495 00:50:09,123 --> 00:50:11,112 There's some examples in TF-Slim, and in Keras. 496 00:50:11,112 --> 00:50:15,874 'Cause remember pretrained models are super important when you're training your own things. 497 00:50:15,874 --> 00:50:21,072 There's also this idea of Tensorboard where you can load up your, I don't want to get into details, 498 00:50:21,072 --> 00:50:27,747 but Tensorboard you can add sort of instrumentation to your code and then plot losses and things as you go through the training process. 499 00:50:27,747 --> 00:50:32,760 TensorFlow also let's you run distributed where you can break up a computational graph run on different machines. 500 00:50:32,760 --> 00:50:37,613 That's super cool but I think probably not anyone outside of Google is really using that to great success 501 00:50:37,613 --> 00:50:44,193 these days, but if you do want to run distributed stuff probably TensorFlow is the main game in town for that. 502 00:50:44,193 --> 00:50:51,533 A side note is that a lot of the design of TensorFlow is kind of spiritually inspired by this earlier framework called Theano from Montreal. 503 00:50:51,533 --> 00:50:55,933 I don't want to go through the details here, just if you go through these slides on your own, 504 00:50:55,933 --> 00:50:59,979 you can see that the code for Theano ends up looking very similar to TensorFlow. 505 00:50:59,979 --> 00:51:03,512 Where we define some variables, we do some forward pass, we compute some gradients, 506 00:51:03,512 --> 00:51:08,034 and we compile some function, then we run the function over and over to train the network. 507 00:51:08,034 --> 00:51:10,290 So it kind of looks a lot like TensorFlow. 508 00:51:10,290 --> 00:51:16,671 So we still have a lot to get through, so I'm going to move on to PyTorch and maybe take questions at the end. 509 00:51:16,671 --> 00:51:26,397 So, PyTorch from Facebook is kind of different from TensorFlow in that we have sort of three explicit different layers of abstraction inside PyTorch. 510 00:51:26,397 --> 00:51:30,619 So PyTorch has this tensor object which is just like a Numpy array. 511 00:51:30,619 --> 00:51:36,770 It's just an imperative array, it doesn't know anything about deep learning, but it can run with GPU. 512 00:51:36,770 --> 00:51:44,093 We have this variable object which is a node in a computational graph which builds up computational graphs, lets you compute gradients, that sort of thing. 513 00:51:44,093 --> 00:51:50,766 And we have a module object which is a neural network layer that you can compose together these modules to build big networks. 514 00:51:50,766 --> 00:52:01,457 So if you kind of want to think about rough equivalents between PyTorch and TensorFlow you can think of the PyTorch tensor as fulfilling the same role as the Numpy array in TensorFlow. 515 00:52:01,457 --> 00:52:08,803 The PyTorch variable is similar to the TensorFlow tensor or variable or placeholder, which are all sort of nodes in a computational graph. 516 00:52:08,803 --> 00:52:18,448 And now the PyTorch module is kind of equivalent to these higher level things from tf.slim or tf.layers or sonnet or these other higher level frameworks. 517 00:52:18,448 --> 00:52:24,072 So right away one thing to notice about PyTorch is that because it ships with this high level abstraction 518 00:52:24,072 --> 00:52:29,780 and like one really nice higher level abstraction called modules on its own, there's sort of less choice involved. 519 00:52:29,780 --> 00:52:36,642 Just stick with nnmodules and you'll be good to go. You don't need to worry about which higher level wrapper to use. 520 00:52:37,777 --> 00:52:41,944 So PyTorch tensors, as I said, are just like Numpy arrays 521 00:52:43,660 --> 00:52:47,787 so here on the right we've done an entire two layer network using entirely PyTorch tensors. 522 00:52:47,787 --> 00:52:53,910 One thing to note is that we're not importing Numpy here at all anymore. We're just doing all these operations using PyTorch tensors. 523 00:52:53,910 --> 00:53:01,245 And this code looks exactly like the two layer net code that you wrote in Numpy on the first homework. 524 00:53:01,245 --> 00:53:07,127 So you set up some random data, you use some operations to compute the forward pass. 525 00:53:07,127 --> 00:53:10,165 And then we're explicitly viewing the backward pass ourself. 526 00:53:10,165 --> 00:53:15,980 Just sort of backhopping through the network, through the operations, just as you did on homework one. 527 00:53:15,980 --> 00:53:22,672 And now we're doing a manual update of the weights using a learning rate and using our computed gradients. 528 00:53:22,672 --> 00:53:27,785 But the major difference between the PyTorch tensor and Numpy arrays is that they run on GPU 529 00:53:27,785 --> 00:53:33,034 so all you have to do to make this code run on GPU is use a different data type. 530 00:53:33,034 --> 00:53:42,816 Rather than using torch.FloatTensor, you do torch.cuda.FloatTensor, cast all of your tensors to this new datatype and everything runs magically on the GPU. 531 00:53:43,709 --> 00:53:47,637 You should think of PyTorch tensors as just Numpy plus GPU. 532 00:53:47,637 --> 00:53:50,818 That's exactly what it is, nothing specific to deep learning. 533 00:53:52,638 --> 00:53:55,278 So the next layer of abstraction in PyTorch is the variable. 534 00:53:55,278 --> 00:54:03,460 So this is, once we moved from tensors to variables now we're building computational graphs and we're able to take gradients automatically and everything like that. 535 00:54:03,460 --> 00:54:12,744 So here, if X is a variable, then x.data is a tensor and x.grad is another variable containing the gradients of the loss with respect to that tensor. 536 00:54:14,007 --> 00:54:17,246 So x.grad.data is an actual tensor containing those gradients. 537 00:54:18,972 --> 00:54:22,387 And PyTorch tensors and variables have the exact same API. 538 00:54:22,387 --> 00:54:28,457 So any code that worked on PyTorch tensors you can just make them variables instead and run the same code, 539 00:54:28,457 --> 00:54:34,459 except now you're building up a computational graph rather than just doing these imperative operations. 540 00:54:35,943 --> 00:54:47,461 So here when we create these variables each call to the variable constructor wraps a PyTorch tensor and then also gives a flag whether or not we want to compute gradients with respect to this variable. 541 00:54:47,461 --> 00:54:54,073 And now in the forward pass it looks exactly like it did before in the variable in the case with tensors because they have the same API. 542 00:54:54,073 --> 00:54:59,683 So now we're computing our predictions, we're computing our loss in kind of this imperative kind of way. 543 00:54:59,683 --> 00:55:05,251 And then we call loss.backwards and now all these gradients come out for us. 544 00:55:05,251 --> 00:55:11,528 And then we can make a gradient update step on our weights using the gradients that are now present in the w1.grad.data. 545 00:55:11,528 --> 00:55:18,137 So this ends up looking quite like the Numpy case, except all the gradients come for free. 546 00:55:18,137 --> 00:55:23,353 One thing to note that's kind of different between PyTorch and TensorFlow is that in a TensorFlow case 547 00:55:23,353 --> 00:55:27,132 we were building up this explicit graph, then running the graph many times. 548 00:55:27,132 --> 00:55:32,152 Here in PyTorch, instead we're building up a new graph every time we do a forward pass. 549 00:55:32,152 --> 00:55:37,058 And this makes the code look a bit cleaner. And it has some other implications that we'll get to in a bit. 550 00:55:37,058 --> 00:55:40,630 So in PyTorch you can define your own new autograd functions 551 00:55:40,630 --> 00:55:42,933 by defining the forward and backward in terms of tensors. 552 00:55:42,933 --> 00:55:48,303 This ends up looking kind of like the module layers code that you write for homework two. 553 00:55:48,303 --> 00:55:54,433 Where you can implement forward and backward using tensor operations and then stick these things inside computational graph. 554 00:55:54,433 --> 00:56:00,654 So here we're defining our own relu and then we can actually go in and use our own relu 555 00:56:00,654 --> 00:56:05,214 operation and now stick it inside our computational graph and define our own operations this way. 556 00:56:05,214 --> 00:56:09,097 But most of the time you will probably not need to define your own autograd operations. 557 00:56:09,097 --> 00:56:14,246 Most of the times the operations you need will mostly be already implemented for you. 558 00:56:14,246 --> 00:56:23,349 So in TensorFlow we saw, if we can move to something like Keras or TF.Learn and this gives us a higher level API to work with, rather than this raw computational graphs. 559 00:56:23,349 --> 00:56:30,948 The equivalent in PyTorch is the nn package. Where it provides these high level wrappers for working with these things. 560 00:56:31,882 --> 00:56:37,772 But unlike TensorFlow there's only one of them. And it works pretty well, so just use that if you're using PyTorch. 561 00:56:37,772 --> 00:56:44,436 So here, this ends up kind of looking like Keras where we define our model as some sequence of layers. Our linear and relu operations. 562 00:56:44,436 --> 00:56:49,816 And we use some loss function defined in the nn package that's our mean squared error loss. 563 00:56:49,816 --> 00:56:55,214 And now inside each iteration of our loop we can run data forward through the model to get our predictions. 564 00:56:55,214 --> 00:56:59,054 We can run the predictions forward through the loss function to get our scale or loss, 565 00:56:59,054 --> 00:57:04,021 then we can call loss.backward, get all our gradients for free and then loop over the parameters of the models 566 00:57:04,021 --> 00:57:07,273 and do our explicit gradient descent step to update the models. 567 00:57:07,273 --> 00:57:12,749 And again we see that we're sort of building up this new computational graph every time we do a forward pass. 568 00:57:12,749 --> 00:57:17,017 And just like we saw in TensorFlow, PyTorch provides these optimizer operations 569 00:57:17,017 --> 00:57:23,000 that kind of abstract away this updating logic and implement fancier update rules like Adam and whatnot. 570 00:57:23,000 --> 00:57:28,771 So here we're constructing an optimizer object telling it that we want it to optimize over the parameters of the model. 571 00:57:28,771 --> 00:57:31,115 Giving it some learning rate under the hyper parameters. 572 00:57:31,115 --> 00:57:39,810 And now after we compute our gradients we can just call optimizer.step and it updates all the parameters of the model for us right here. 573 00:57:39,810 --> 00:57:44,714 So another common thing you'll do in PyTorch a lot is define your own nn modules. 574 00:57:44,714 --> 00:57:51,801 So typically you'll write your own class which defines you entire model as a single new nn module class. 575 00:57:51,801 --> 00:58:01,043 And a module is just kind of a neural network layer that can contain either other other modules or trainable weights or other other kinds of state. 576 00:58:01,043 --> 00:58:07,051 So in this case we can redo the two layer net example by defining our own nn module class. 577 00:58:07,051 --> 00:58:11,672 So now here in the initializer of the class we're assigning this linear1 and linear2. 578 00:58:11,672 --> 00:58:17,257 We're constructing these new module objects and then store them inside of our own class. 579 00:58:17,257 --> 00:58:26,466 And now in the forward pass we can use both our own internal modules as well as arbitrary autograd operations on variables to compute the output of our network. 580 00:58:26,466 --> 00:58:31,594 So here we receive the, inside this forward method here, the input x as a variable, 581 00:58:31,594 --> 00:58:35,817 then we pass the variable to our self.linear1 for the first layer. 582 00:58:35,817 --> 00:58:38,129 We use an autograd op clamp to complete the relu, 583 00:58:38,129 --> 00:58:42,233 we pass the output of that to the second linear and then that gives us our output. 584 00:58:42,233 --> 00:58:46,633 And now the rest of this code for training this thing looks pretty much the same. 585 00:58:46,633 --> 00:58:54,676 Where we build an optimizer and loop over and on ever iteration feed data to the model, compute the gradients with loss.backwards, call optimizer.step. 586 00:58:54,676 --> 00:59:01,817 So this is like relatively characteristic of what you might see in a lot of PyTorch type training scenarios. 587 00:59:01,817 --> 00:59:11,166 Where you define your own class, defining your own model that contains other modules and whatnot and then you have some explicit training loop like this that runs it and updates it. 588 00:59:11,166 --> 00:59:18,873 One kind of nice quality of life thing that you have in PyTorch is a dataloader. So a dataloader can handle building minibatches for you. 589 00:59:18,873 --> 00:59:27,273 It can handle some of the multi-threading that we talked about for you, where it can actually use multiple threads in the background to build many batches for you and stream off disk. 590 00:59:27,273 --> 00:59:33,221 So here a dataloader wraps a dataset and provides some of these abstractions for you. 591 00:59:33,221 --> 00:59:40,208 And in practice when you want to run your own data, you typically will write your own dataset class which knows how to read your particular type of data 592 00:59:40,208 --> 00:59:44,458 off whatever source you want and then wrap it in a data loader and train with that. 593 00:59:44,458 --> 00:59:52,233 So, here we can see that now we're iterating over the dataloader object and at every iteration this is yielding minibatches of data. 594 00:59:52,233 --> 00:59:58,409 And it's internally handling the shuffling of the data and multithreaded dataloading and all this sort of stuff for you. 595 00:59:58,409 --> 01:00:04,161 So this is kind of a completely PyTorch example and a lot of PyTorch training code ends up looking something like this. 596 01:00:05,583 --> 01:00:07,587 PyTorch provides pretrained models. 597 01:00:07,587 --> 01:00:11,521 And this is probably the slickest pretrained model experience I've ever seen. 598 01:00:11,521 --> 01:00:14,268 You just say torchvision.models.alexnet pretained=true. 599 01:00:14,268 --> 01:00:18,759 That'll go down in the background, download the pretrained weights for you if you don't already have them, 600 01:00:18,759 --> 01:00:24,242 and then it's right there, you're good to go. So this is super easy to use. 601 01:00:24,242 --> 01:00:27,094 PyTorch also has, there's also a package called Visdom 602 01:00:27,094 --> 01:00:33,600 that lets you visualize some of these loss statistics somewhat similar to Tensorboard. 603 01:00:33,600 --> 01:00:38,569 So that's kind of nice, I haven't actually gotten a chance to play around with this myself so I can't really speak to how useful it is, 604 01:00:38,569 --> 01:00:45,907 but one of the major differences between Tensorboard and Visdom is that Tensorboard actually lets you visualize the structure of the computational graph. 605 01:00:45,907 --> 01:00:50,989 Which is really cool, a really useful debugging strategy. And Visdom does not have that functionality yet. 606 01:00:50,989 --> 01:00:54,761 But I've never really used this myself so I can't really speak to its utility. 607 01:00:56,350 --> 01:01:05,491 As a bit of an aside, PyTorch is kind of an evolution of, kind of a newer updated version of an older framework called Torch which I worked with a lot in the last couple of years. 608 01:01:05,491 --> 01:01:13,280 And I don't want to go through the details here, but PyTorch is pretty much better in a lot of ways than the old Lua Torch, but they actually share a lot 609 01:01:13,280 --> 01:01:18,100 of the sameback end C code for computing with tensors and GPU operations on tensors and whatnot. 610 01:01:18,100 --> 01:01:23,369 So if you look through this Torch example, some of it ends up looking kind of similar to PyTorch, some of it's a bit different. 611 01:01:23,369 --> 01:01:25,957 Maybe you can step through this offline. 612 01:01:25,957 --> 01:01:33,011 But kind of the high level differences between Torch and PyTorch are that Torch is actually in Lua, not Python, unlike these other things. 613 01:01:33,011 --> 01:01:37,748 So learning Lua is a bit of a turn off for some people. 614 01:01:37,748 --> 01:01:40,009 Torch doesn't have autograd. 615 01:01:40,009 --> 01:01:44,324 Torch is also older, so it's more stable, less susceptible to bugs, there's maybe more example code for Torch. 616 01:01:45,230 --> 01:01:47,214 They're about the same speeds, that's not really a concern. 617 01:01:47,214 --> 01:01:54,531 But in PyTorch it's in Python which is great, you've got autograd which makes it a lot simpler to write complex models. 618 01:01:54,531 --> 01:01:59,670 In Lua Torch you end up writing a lot of your own back prop code sometimes, so that's a little bit annoying. 619 01:01:59,670 --> 01:02:06,051 But PyTorch is newer, there's less existing code, it's still subject to change. So it's a little bit more of an adventure. 620 01:02:06,051 --> 01:02:17,765 But at least for me, I kind of prefer, I don't really see much reason for myself to use Torch over PyTorch anymore at this time. So I'm pretty much using PyTorch exclusively for all my work these days. 621 01:02:18,606 --> 01:02:22,531 We talked about this a little bit about this idea of static versus dynamic graphs. 622 01:02:22,531 --> 01:02:26,291 And this is one of the main distinguishing features between PyTorch and TensorFlow. 623 01:02:26,291 --> 01:02:38,145 So we saw in TensorFlow you have these two stages of operation where first you build up this computational graph, then you run the computational graph over and over again many many times reusing that same graph. 624 01:02:38,145 --> 01:02:42,403 That's called a static computational graph 'cause there's only one of them. 625 01:02:42,403 --> 01:02:48,771 And we saw PyTorch is quite different where we're actually building up this new computational graph, this new fresh thing on every forward pass. 626 01:02:48,771 --> 01:02:52,259 That's called a dynamic computational graph. 627 01:02:52,259 --> 01:02:57,053 For kind of simple cases, with kind of feed forward neural networks, it doesn't really make a huge difference, 628 01:02:57,053 --> 01:03:00,225 the code ends up kind of similarly and they work kind of similarly, 629 01:03:00,225 --> 01:03:07,102 but I do want to talk a bit about some of the implications of static versus dynamic. And what are the tradeoffs of those two. 630 01:03:07,102 --> 01:03:15,286 So one kind of nice idea with static graphs is that because we're kind of building up one computational graph once, and then reusing it many times, 631 01:03:15,286 --> 01:03:19,571 the framework might have the opportunity to go in and do optimizations on that graph. 632 01:03:19,571 --> 01:03:26,809 And kind of fuse some operations, reorder some operations, figure out the most efficient way to operate that graph so it can be really efficient. 633 01:03:26,809 --> 01:03:33,039 And because we're going to reuse that graph many times, maybe that optimization process is expensive up front, 634 01:03:33,039 --> 01:03:37,230 but we can amortize that cost with the speedups that we've gotten when we run the graph many many times. 635 01:03:37,230 --> 01:03:44,085 So as kind of a concrete example, maybe if you write some graph which has convolution and relu operations kind of one after another, 636 01:03:44,085 --> 01:03:54,530 you might imagine that some fancy graph optimizer could go in and actually output, like emit custom code which has fused operations, fusing the convolution 637 01:03:54,530 --> 01:04:03,445 and the relu so now it's computing the same thing as the code you wrote, but now might be able to be executed more efficiently. 638 01:04:03,445 --> 01:04:10,419 So I'm not too sure on exactly what the state in practice of TensorFlow graph optimization is right now, 639 01:04:10,419 --> 01:04:20,131 but at least in principle, this is one place where static graph really, you can have the potential for doing this optimization in static graphs 640 01:04:20,131 --> 01:04:24,298 where maybe it would be not so tractable for dynamic graphs. 641 01:04:25,504 --> 01:04:28,931 Another kind of subtle point about static versus dynamic is this idea of serialization. 642 01:04:28,931 --> 01:04:34,026 So with a static graph you can imagine that you write this code that builds up the graph 643 01:04:34,026 --> 01:04:39,571 and then once you've built the graph, you have this data structure in memory that represents the entire structure of your network. 644 01:04:39,571 --> 01:04:42,428 And now you could take that data structure and just serialize it to disk. 645 01:04:42,428 --> 01:04:45,996 And now you've got the whole structure of your network saved in some file. 646 01:04:45,996 --> 01:04:55,450 And then you could later re-load that thing and then run that computational graph without access to the original code that built it. So this would be kind of nice in a deployment scenario. 647 01:04:55,450 --> 01:05:00,424 You might imagine that you might want to train your network in Python because it's maybe easier to work with, 648 01:05:00,424 --> 01:05:07,759 but then after you serialize that network and then you could deploy it now in maybe a C++ environment where you don't need to use the original code that built the graph. 649 01:05:07,759 --> 01:05:10,909 So that's kind of a nice advantage of static graphs. 650 01:05:10,909 --> 01:05:15,793 Whereas with a dynamic graph, because we're interleaving these processes of graph building and graph execution, 651 01:05:15,793 --> 01:05:22,012 you kind of need the original code at all times if you want to reuse that model in the future. 652 01:05:22,012 --> 01:05:29,163 On the other hand, some advantages for dynamic graphs are that it kind of makes, it just makes your code a lot cleaner and a lot easier in a lot of scenarios. 653 01:05:29,163 --> 01:05:38,624 So for example, suppose that we want to do some conditional operation where depending on the value of some variable Z, we want to do different operations to compute Y. 654 01:05:39,723 --> 01:05:45,070 Where if Z is positive, we want to use one weight matrix, if Z is negative we want to use a different weight matrix. 655 01:05:45,070 --> 01:05:47,981 And we just want to switch off between these two alternatives. 656 01:05:47,981 --> 01:05:52,011 In PyTorch because we're using dynamic graphs, it's super simple. 657 01:05:52,011 --> 01:06:00,795 Your code kind of looks exactly like you would expect, exactly what you would do in Numpy. You can just use normal Python control flow to handle this thing. 658 01:06:00,795 --> 01:06:05,563 And now because we're building up the graph each time, each time we perform this operation will take one 659 01:06:05,563 --> 01:06:10,864 of the two paths and build up maybe a different graph on each forward pass, but for any graph that we do 660 01:06:10,864 --> 01:06:14,337 end up building up, we can back propagate through it just fine. 661 01:06:14,337 --> 01:06:15,941 And the code is very clean, easy to work with. 662 01:06:15,941 --> 01:06:23,201 Now in TensorFlow the situations is a little bit more complicated because we build the graph once, 663 01:06:23,201 --> 01:06:28,400 this control flow operator kind of needs to be an explicit operator in the TensorFlow graph. 664 01:06:28,400 --> 01:06:36,818 And now, so them you can see that we have this tf.cond call which is kind of like a TensorFlow version of an if statement, but now it's baked into 665 01:06:36,818 --> 01:06:40,741 the computational graph rather than using sort of Python control flow. 666 01:06:40,741 --> 01:06:48,729 And the problem is that because we only build the graph once, all the potential paths of control flow that our program might flow through need to be baked 667 01:06:48,729 --> 01:06:52,523 into the graph at the time we construct it before we ever run it. 668 01:06:52,523 --> 01:07:03,360 So that means that any kind of control flow operators that you want to have need to be not Python control flow operators, you need to use some kind of magic, special tensor flow operations to do control flow. 669 01:07:03,360 --> 01:07:05,527 In this case this tf.cond. 670 01:07:06,713 --> 01:07:10,763 Another kind of similar situation happens if you want to have loops. 671 01:07:10,763 --> 01:07:19,839 So suppose that we want to compute some kind of recurrent relationships where maybe Y T is equal to Y T minus one plus X T times some weight matrix W and depending on 672 01:07:19,839 --> 01:07:26,436 each time we do this, every time we compute this, we might have a different sized sequence of data. 673 01:07:26,436 --> 01:07:33,371 And no matter the length of our sequence of data, we just want to compute this same recurrence relation no matter the size of the input sequence. 674 01:07:33,371 --> 01:07:39,489 So in PyTorch this is super easy. We can just kind of use a normal for loop in Python 675 01:07:39,489 --> 01:07:47,095 to just loop over the number of times that we want to unroll and now depending on the size of the input data, our computational graph will end up as different sizes, 676 01:07:47,095 --> 01:07:51,694 but that's fine, we can just back propagate through each one, one at a time. 677 01:07:51,694 --> 01:07:55,782 Now in TensorFlow this becomes a little bit uglier. 678 01:07:55,782 --> 01:08:06,364 And again, because we need to construct the graph all at once up front, this control flow looping construct again needs to be an explicit node in the TensorFlow graph. 679 01:08:06,364 --> 01:08:13,517 So I hope you remember your functional programming because you'll have to use those kinds of operators to implement looping constructs in TensorFlow. 680 01:08:13,517 --> 01:08:23,024 So in this case, for this particular recurrence relationship you can use a foldl operation and pass in, sort of implement this particular loop in terms of a foldl. 681 01:08:24,100 --> 01:08:28,734 But what this basically means is that you have this sense that TensorFlow is almost building its own entire 682 01:08:28,734 --> 01:08:33,212 programming language, using the language of computational graphs. 683 01:08:33,212 --> 01:08:37,215 And any kind of control flow operator, or any kind of data structure needs to be rolled 684 01:08:37,215 --> 01:08:44,216 into the computational graph so you can't really utilize all your favorite paradigms for working imperatively in Python. 685 01:08:44,216 --> 01:08:52,804 You kind of need to relearn a whole separate set of control flow operators. And if you want to do any kinds of control flow inside your computational graph using TensorFlow. 686 01:08:52,804 --> 01:08:58,238 So at least for me, I find that kind of confusing, a little bit hard to wrap my head around sometimes, 687 01:08:58,238 --> 01:09:06,722 and I kind of like that using PyTorch dynamic graphs, you can just use your favorite imperative programming constructs and it all works just fine. 688 01:09:07,737 --> 01:09:21,579 By the way, there actually is some very new library called TensorFlow Fold which is another one of these layers on top of TensorFlow that lets you implement dynamic graphs, you kind of write your own code 689 01:09:22,416 --> 01:09:32,277 using TensorFlow Fold that looks kind of like a dynamic graph operation and then TensorFlow Fold does some magic for you and somehow implements that in terms of the static TensorFlow graphs. 690 01:09:32,277 --> 01:09:37,357 This is a super new paper that's being presented at ICLR this week in France. 691 01:09:37,358 --> 01:09:41,694 So I haven't had the chance to like dive in and play with this yet. 692 01:09:41,694 --> 01:09:46,455 But my initial impression was that it does add some amount of dynamic graphs to TensorFlow but it is still 693 01:09:46,455 --> 01:09:51,952 a bit more awkward to work with than the sort of native dynamic graphs you have in PyTorch. 694 01:09:51,952 --> 01:09:57,257 So then, I thought it might be nice to motivate like why would we care about dynamic graphs in general? 695 01:09:57,257 --> 01:10:00,257 So one option is recurrent networks. 696 01:10:01,177 --> 01:10:07,612 So you can see that for something like image captioning we use a recurrent network which operates over sequences of different lengths. 697 01:10:07,612 --> 01:10:13,337 In this case, the sentence that we want to generate as a caption is a sequence and that sequence can vary 698 01:10:13,337 --> 01:10:15,636 depending on our input data. 699 01:10:15,636 --> 01:10:21,694 So now you can see that we have this dynamism in the thing where depending on the size of the sentence, 700 01:10:21,694 --> 01:10:25,716 our computational graph might need to have more or fewer elements. 701 01:10:25,716 --> 01:10:29,920 So that's one kind of common application of dynamic graphs. 702 01:10:29,920 --> 01:10:36,377 For those of you who took CS224N last quarter, you saw this idea of recursive networks 703 01:10:36,377 --> 01:10:47,337 where sometimes in natural language processing you might, for example, compute a parsed tree of a sentence and then you want to have a neural network kind of operate recursively up this parse tree. 704 01:10:47,337 --> 01:10:56,856 So having a neural network that kind of works, it's not just a sequential sequence of layers, but instead it's kind of working over some graph or tree structure instead where now each data point 705 01:10:56,856 --> 01:10:58,732 might have a different graph or tree structure 706 01:10:58,732 --> 01:11:05,714 so the structure of the computational graph then kind of mirrors the structure of the input data. And it could vary from data point to data point. 707 01:11:05,714 --> 01:11:10,316 So this type of thing seems kind of complicated and hairy to implement using TensorFlow, 708 01:11:10,316 --> 01:11:14,887 but in PyTorch you can just kind of use like normal Python control flow and it'll work out just fine. 709 01:11:16,574 --> 01:11:23,678 Another bit of more research application is this really cool idea that I like called neuromodule networks for visual question answering. 710 01:11:23,678 --> 01:11:31,737 So here the idea is that we want to ask some questions about images where we maybe input this image of cats and dogs, there's some question, 711 01:11:31,737 --> 01:11:43,594 what color is the cat, and then internally the system can read the question and that has these different specialized neural network modules for performing operations like asking for colors and finding cats. 712 01:11:43,594 --> 01:11:49,838 And then depending on the text of the question, it can compile this custom architecture for answering the question. 713 01:11:49,838 --> 01:11:55,094 And now if we asked a different question, like are there more cats than dogs? 714 01:11:55,094 --> 01:12:03,076 Now we have maybe the same basic set of modules for doing things like finding cats and dogs and counting, but they're arranged in a different order. 715 01:12:03,076 --> 01:12:07,716 So we get this dynamism again where different data points might give rise to different computational graphs. 716 01:12:07,716 --> 01:12:12,574 But this is a bit more of a research thing and maybe not so main stream right now. 717 01:12:12,574 --> 01:12:19,214 But as kind of a bigger point, I think that there's a lot of cool, creative applications that people could do with dynamic computational graphs 718 01:12:19,214 --> 01:12:23,471 and maybe there aren't so many right now, just because it's been so painful to work with them. 719 01:12:23,471 --> 01:12:30,596 So I think that there's a lot of opportunity for doing cool, creative things with dynamic computational graphs. 720 01:12:30,596 --> 01:12:34,078 And maybe if you come up with cool ideas, we'll feature it in lecture next year. 721 01:12:34,078 --> 01:12:39,854 So I wanted to talk very briefly about Caffe which is this framework from Berkeley. 722 01:12:39,854 --> 01:12:48,815 Which Caffe is somewhat different from the other deep learning frameworks where you in many cases you can actually train networks without writing any code yourself. 723 01:12:48,815 --> 01:12:53,214 You kind of just call into these pre-existing binaries, set up some configuration files and in many cases 724 01:12:53,214 --> 01:12:56,697 you can train on data without writing any of your own code. 725 01:12:56,697 --> 01:13:03,054 So, you may be first, you convert your data into some format like HDF5 or LMDB and there exists 726 01:13:03,054 --> 01:13:08,638 some scripts inside Caffe that can just convert like folders of images and text files into these formats for you. 727 01:13:08,638 --> 01:13:19,934 You need to define, now instead of writing code to define the structure of your computational graph, instead you edit some text file called a prototxt which sets up the structure of the computational graph. 728 01:13:19,934 --> 01:13:30,875 Here the structure is that we read from some input HDF5 file, we perform some inner product, we compute some loss and the whole structure of the graph is set up in this text file. 729 01:13:30,875 --> 01:13:35,956 One kind of downside here is that these files can get really ugly for very large networks. 730 01:13:35,956 --> 01:13:44,253 So for something like the 152 layer ResNet model, which by the way was trained in Caffe originally, then this prototxt file ends up almost 7000 lines long. 731 01:13:44,253 --> 01:13:51,817 So people are not writing these by hand. People will sometimes will like write python scripts to generate these prototxt files. 732 01:13:51,817 --> 01:13:53,275 [laughter] 733 01:13:53,275 --> 01:13:58,974 Then you're kind in the realm of rolling your own computational graph abstraction. That's probably not a good idea, but I've seen that before. 734 01:13:58,974 --> 01:14:07,497 Then, rather than having some optimizer object, instead there's some solver, you define some solver things inside another prototxt. 735 01:14:07,497 --> 01:14:11,036 This defines your learning rate, your optimization algorithm and whatnot. 736 01:14:11,036 --> 01:14:17,278 And then once you do all these things, you can just run the Caffe binary with the train command and it all happens magically. 737 01:14:17,278 --> 01:14:21,294 Cafee has a model zoo with a bunch of pretrained models, that's pretty useful. 738 01:14:21,294 --> 01:14:25,438 Caffe has a Python interface but it's not super well documented. 739 01:14:25,438 --> 01:14:31,455 You kind of need to read the source code of the python interface to see what it can do, so that's kind of annoying. But it does work. 740 01:14:31,455 --> 01:14:40,174 So, kind of my general thing about Caffe is that it's maybe good for feed forward models, it's maybe good for production scenarios, 741 01:14:40,174 --> 01:14:42,796 because it doesn't depend on Python. 742 01:14:42,796 --> 01:14:47,358 But probably for research these days, I've seen Caffe being used maybe a little bit less. 743 01:14:47,358 --> 01:14:51,417 Although I think it is still pretty commonly used in industry again for production. 744 01:14:51,417 --> 01:14:54,410 I promise one slide, one or two slides on Caffe 2. 745 01:14:54,410 --> 01:14:58,596 So Caffe 2 is the successor to Caffe which is from Facebook. 746 01:14:58,596 --> 01:15:02,432 It's super new, it was only released a week ago. 747 01:15:02,432 --> 01:15:04,436 [laughter] 748 01:15:04,436 --> 01:15:09,314 So I really haven't had the time to form a super educated opinion about Caffe 2 yet, 749 01:15:09,314 --> 01:15:12,318 but it uses static graphs kind of similar to TensorFlow. 750 01:15:12,318 --> 01:15:17,817 Kind of like Caffe one the core is written in C++ and they have some Python interface. 751 01:15:17,817 --> 01:15:21,518 The difference is that now you no longer need to write your own Python scripts to generate prototxt files. 752 01:15:21,518 --> 01:15:29,657 You can kind of define your computational graph structure all in Python, kind of looking with an API that looks kind of like TensorFlow. 753 01:15:29,657 --> 01:15:34,596 But then you can spit out, you can serialize this computational graph structure to a prototxt file. 754 01:15:34,596 --> 01:15:38,676 And then once your model is trained and whatnot, then we get this benefit that we talked about of static 755 01:15:38,676 --> 01:15:43,534 graphs where you can, you don't need the original training code now in order to deploy a trained model. 756 01:15:43,534 --> 01:15:49,417 So one interesting thing is that you've seen Google maybe has one major deep running framework, 757 01:15:49,417 --> 01:15:53,761 which is TensorFlow, where Facebook has these two, PyTorch and Caffe 2. 758 01:15:54,596 --> 01:15:57,252 So these are kind of different philosophies. 759 01:15:57,252 --> 01:16:02,847 Google's kind of trying to build one framework to rule them all that maybe works for every possible scenario for deep learning. 760 01:16:02,847 --> 01:16:07,852 This is kind of nice because it consolidates all efforts onto one framework. It means you only need to learn one thing 761 01:16:07,852 --> 01:16:13,772 and it'll work across many different scenarios including like distributed systems, production, deployment, mobile, research, everything. 762 01:16:13,772 --> 01:16:15,706 Only need to learn one framework to do all these things. 763 01:16:15,706 --> 01:16:18,151 Whereas Facebook is taking a bit of a different approach. 764 01:16:18,151 --> 01:16:26,071 Where PyTorch is really more specialized, more geared towards research so in terms of writing research code and quickly iterating on your ideas, 765 01:16:26,071 --> 01:16:32,951 that's super easy in PyTorch, but for things like running in production, running on mobile devices, PyTorch doesn't have a lot of great support. 766 01:16:32,951 --> 01:16:37,710 Instead, Caffe 2 is kind of geared toward those more production oriented use cases. 767 01:16:39,567 --> 01:16:47,350 So my kind of general study, my general, overall advice about like which framework to use for which problems is kind of that both, 768 01:16:47,350 --> 01:16:53,510 I think TensorFlow is a pretty safe bet for just about any project that you want to start new, right? 769 01:16:53,510 --> 01:16:58,849 Because it is sort of one framework to rule them all, it can be used for just about any circumstance. 770 01:16:58,849 --> 01:17:05,207 However, you probably need to pair it with a higher level wrapper and if you want dynamic graphs, you're maybe out of luck. 771 01:17:05,207 --> 01:17:13,190 Some of the code ends up looking a little bit uglier in my opinion, but maybe that's kind of a cosmetic detail and it doesn't really matter that much. 772 01:17:13,190 --> 01:17:15,809 I personally think PyTorch is really great for research. 773 01:17:15,809 --> 01:17:21,233 If you're focused on just writing research code, I think PyTorch is a great choice. 774 01:17:21,233 --> 01:17:25,649 But it's a bit newer, has less community support, less code out there, so it could be a bit of an adventure. 775 01:17:25,649 --> 01:17:29,969 If you want more of a well trodden path, TensorFlow might be a better choice. 776 01:17:29,969 --> 01:17:34,710 If you're interested in production deployment, you should probably look at Caffe, Caffe 2 or TensorFlow. 777 01:17:34,710 --> 01:17:41,270 And if you're really focused on mobile deployment, I think TensorFlow and Caffe 2 both have some built in support for that. 778 01:17:41,270 --> 01:17:47,393 So it's kind of unfortunately, there's not just like one global best framework, it kind of depends on what you're actually trying to do, 779 01:17:47,393 --> 01:17:52,045 what applications you anticipate but theses are kind of my general advice on those things. 780 01:17:53,169 --> 01:17:55,691 So next time we'll talk about some case studies of various of CNN architectures.